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Abstract 

Breakage-Fusion-Bridge cycles in cancer arise when a broken segment of DNA is 
duplicated and an end from each copy joined together. This structure then 'unfolds' 
i— —i into a new piece of palindromic DNA. This is one mechanism responsible for the 

^ | localised amplicons observed in cancer genome data. The process has parallels with 

paper folding sequences that arise when a piece of paper is folded several times and 
then unfolded. Here we adapt such methods to study the breakage-fusion-bridge 
structures in detail. We firstly consider discrete representations of this space with 2-d 
r^J trees to demonstrate that there are 2 2 qualitatively distinct evolutions involving 

n breakage-fusion-bridge cycles. Secondly we consider the stochastic nature of the 
fold positions, to determine evolution likelihoods, and also describe how amplicons 
become localised. Finally we highlight these methods by inferring the evolution of 
breakage-fusion-bridge cycles with data from primary tissue cancer samples. 

\o 

in d 

m 
(N 

5 1 Introduction 

Breakage-Fusion-Bridge (BFB) cycles potentially arise whenever a stretch of DNA is 
broken and a cell division cycle takes place. The first stage in this division process is 
DNA replication, where duplication will take place up to the DNA break, leaving two 
^ exposed ends. Prior to cell division, the cells checkpoint machinery will look for mistakes 

and the two exposed ends may be erroneously joined together by double stranded break 
repair. This results in a palindromic sequence of DNA, often containing a duplicated 
centromere (see Figure [l]A). Spindles then attach to centromeres, which then contract 
during cell division to pull a chromosome into each daughter cell. However, if each 
centromere of this dicentric chromosome is to successfully relocate to distinct daughter 
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cells, the DNA between the centromeres has to snap, and so each daughter cell will 
have a centromere on a DNA segment with an exposed (broken) end. This process 
of duplication, end repair and DNA breaking during can then continue with each cell 
division and the process repeat itself in a cascade of BFB cycles. 

The process is unlikely to continue indefinitely because eventually repair machinery 
will attach exposed ends to other portions of the genome to produce a translocation, for 
example, or telomerase may cap the end to produce a somatic telomere. However, this 
process of repeatedly duplicating, repairing and unfolding is known to produce complex 
rearrangements of the original genomic assembly, and are frequently observed in cancer 
genomes [U |3]. The complexities of these rearrangements have also been examined 
algebraically in [6]. 

These genomic rearrangements closely resemble paper folding operations in origami 
where paper is repeatedly folded in upward and downward directions. When the paper 
is unraveled, we obtain a series of troughs and peaks which can be represented as a 
binary sequence. These sequences can be recursively generated and serve as examples 
of automaton pQ, which gives us a starting point to represent BFB processes. 

There are some key differences to note however. Whilst paper is intrinsically the 
same material at all positions, DNA is composed of a variable sequence of nucleotides 
and subsequently is identinably unique along its length prior to the BFB duplication 
process (DNA repeats are ignored). We can thus label each point along the original 
DNA sequence with its genomic position and consider how these labels are duplicated 
in the BFB process. By comparing the positions on the BFB product with the original 
reference sequence, we can fold the BFB product so the same labels (i.e. reference 
positions) are vertically aligned, such as in Figure [Tj3, where three folds are required. 
Note that these folds are located at precisely the two reference positions of DNA repair 
in the BFB cycles. The term fold and the folded structure relative to the reference will 
be used in the majority of the work. We will also refer to the stretch of DNA between 
two consecutive folds as a segment. 

This representation mirrors that observed experimentally. In Figure[Tp, for example, 
we have the results of some next generation sequencing. The horizontal axis represents 
the reference position, and the vertical axis the experimental signal (sequence read depth 
\12\ I13j). We see this signal is relatively constant within each of six regions /, II, VI. 
The changes in signal where regions meet also coincides with positions of DNA folds 
detected with next generation sequencing; the green and red vertical lines indicating 
folds pointing left and right, respectively. Collectively, these results are indicative of a 
sequence of BFB cycles, and we will later infer the likely underlying folding structure, 
the prediction indicated in Figure [lp. 

This inference relies in part on the linear relationship between the experimental 
signal and the number of copies of DNA 'folded' across a given reference region, the 
copy number profile, summarized in Figure [T}5. The top copy number vector cn counts 
the number of copies of each region in the final structure, and fold number vector fn 
counts the number of copies of each fold. For this prediction, we see that region II (with 
the highest signal) has a predicted copy number of 18; 16 copies in the BFB structure 
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Figure 1: The BFB Process. A) A representation of a chromosome, the circle being 
a centromere, the red and yellow markers hypothetical genes duplicated and deleted 
throughout this process. A DNA break (at position of orange triangles) followed by du- 
plication and repair results in a palindromic chromosome with two centromeres. Spindles 
grab each centromere and contract during cell division resulting in another break and 
the cycle continues. B) The BFB product (*) is folded relative to the original reference 
genome. C) A typical amplicon formed through a BFB process. The horizontal axis 
is genomic position, the vertical axis is read depth. Vertical lines indicated detected 
BFBs. D) The predicted BFB folding pattern. E) The copy number profile; cn counts 
the number of genomic copies in each region and fn counts the number of folds at each 
fold loci. 
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and 2 from other copies of the original chromosome. The fold numbers at the left and 
right side of this region are 2 and 8, respectively, reflecting the number of times DNA 
reverses direction relative to the reference. Note that as we move from region I to II, 
the copy number changes by 16, and the fold number is half of this, 8. This is because 
each fold accounts for two genomic copies. 

These data constitute an example of an 'amplicon', which are frequently observed 
in cancer genome data. These are clusters of rearrangements with a high signal in the 
reference genome, indicating an abnormally large number of copies are present in the 
cancer genome. These 'amplified' regions are usually restricted to a few megabases 
of DNA, a small proportion of a typical human chromosome. The BFB process is 
one mechanism by which these events can arise [U [3]. Next generation sequencing 
technologies mean we can now visualize these events in great detail, producing extensive 
catalogs of the mutations involved [12] , |13j , from which the etiology of these events can 
then be investigated [I], [15]. 

In this work we consider several interesting questions that naturally arise from the 
processes we are considering and the data they produce. Firstly, consider the prob- 
lem of how best to represent this process. It is both discrete, in terms of the type of 
folded structures that can arise, and continuous, in terms of the reference positions of 
the folds. By introducing a discrete representation of the BFB process, we provide a 
coherent representation of the genomic conformations that can arise in BFB 'space'. 
Furthermore, this structure allows us to measure the size of this space, proving that 

n(n — 1) 

there are 2 2 qualitatively distinct evolutions given n BFB cycles. We also model 
the process stochastically. This allow us to reconstruct the most probable evolution 
of BFB processes underlying any observed amplicon. Furthermore, this provides some 
understanding into why amplicons are so localised in the genome. 

The paper is arranged as follows. The next section considers how to discretely 
represent the process as an iteration on algebraic words, and how this iteration can be 
inverted, extending some ideas in [6j. This enables the BFB process to be identified 
from the final DNA sequence (but not the copy number profile, which contains less 
information). We then introduce the more compact BFB sequence which is a discrete 
way of labeling the BFB process. This is non-unique, and each BFB sequence produces 
a partially ordered set (poset) of possible BFB processes, each with a range of possible 
copy number profiles. The subsequent section introduces methods to produce and count 
the BFBs contained within each poset and hence determine how the size of the space 
of BFB products grows with the number of BFB cycles that have taken place. We 
then consider the stochastic nature of the fold positions, and the impact this has on 
the likelihood of observing possible BFB structures. Applications of these methods and 
the difficulties of dealing with real data are then explored in more detail. Concluding 
remarks complete the paper. All proofs can be found in the Appendix. 
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2 Word Representations of BFB Processes 



We now consider algebraic representations of the BFB process, both in a forward di- 
rection, reflecting the evolution of the BFB process, and backward, indicating how to 
unravel a BFB folded structure, reverse the process, and infer the events that have taken 
place. 

2.1 The Forward Process 

The BFB process can be described as an iteration scheme on a word of symbols. This 
follows ideas from paper folding sequences, where the binary letters of words represent 
peaks and troughs that run along an unfolded piece of paper formed from a series of 
folding operations. These words can be constructed by an iteration of word operators 
each of which depend upon whether the folding action was up or down pQ . 

For BFBs we have to generalize this somewhat. Firstly, binary sequences prove 
inadequate, and there are two symbolic 'languages' that we draw from; one is in terms 
of the (reference) positions of folds, the other is in terms of the regions of the reference 
genome that bridge these positions. The latter representation was considered in [6]. 

Consider the example in Figure [2] Here a segment of DNA undergoes a series of 
five BFB cycles (first column). This results in five folds positions which partition the 
reference genome into six regions, which we label A, B, C, D, E and F, our first language. 
These are initially contiguous and we represent this as the region word ABCDEF , as 
given in the second column. Our second language utilizes the reference positions of BFB 
folds, labeled 0,1,..., 6, which includes the ends of the original structure. Because no 
BFBs have yet taken place, we have the trivial 0-fold word 06, representing the ends 
of the region, as indicated in the third column. The l 0- indicates no folds have yet 
occurred. 

We now implement the BFB process. Firstly, we have a break in our segment at 
the 1 st fold position, x%, separating regions E and F. We suppose the DNA to the 
left of the break is duplicated and the right side is lost to a different daughter cell. 
The two duplicated ends at position x\ are then stitched together to form one new 
structure. If we read the regions traversed through this new structure, we have word 
ABC DEE' 1 D~ 1 C~ 1 B' 1 A -1 , where a negative power indicates the region is in a re- 
versed direction. Note that the sub-word ..EE' 1 ., essentially demarcates the right point- 
ing fold at this position. Note also that the word is pseudo-palindromic; the letters are 
palindromic and the signs on the right hand side are opposite to those on the left. If 
we reverse the order of the symbols and change their signs we find the word is equal to 
its inverse, W~ x = W, which reflects the chromosome symmetry; if we turn the chro- 
mosome upside down, we end up with an identical structure. In the second language of 
folds we traverse the structure from one end reporting fold numbers when we reach a 
fold; this gives the simpler 1-fold word 010. Unlike the region word representation, we 
do not have signs to indicate direction and note that as we traverse the structure, the 
folds alternate the direction they point. We will see below that both representations are 
equivalent. 
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06 



x<x,<x, 



01210 2 12-1 x<x<x, 




ABCDEE-'D-'C-'B-'BCDEE-'D-'C" 1 
CDEE-'D-'C'B-'BCDEE-'D-'C'B-'A- 1 



ABCDEEWB-'BCD 

D'CB-'BCDEE'D'CB-'A- 1 



012131210 



3 2 4 



0124210 



x,<x,<x, 

2 4 1 




ABCDEE-'D-'CB-'BCDEE-'D-'CB-'BCDEE-'D" 1 
DEE-'D-'C-'B-'BCDEE-'D-'C-'B-'BCDEE-'D-'C-'B-'A" 1 



0124215124210 5 3 6 



x <x s <x, 



Figure 2: The first column indicates a sequence of five BFBs. The second and third 
columns gives region and fold word representations. The fourth column indexes the 
BFB. The fifth column indicates the BFB sequence. The sixth and seventh columns are 
the cumulative representative sequence and the directional signature. The final column 
provides inequalities satisfied by the fold positions. 



We continue the process, the next fold occurring at position X2, resulting in a struc- 
ture with three folds and four segments, with corresponding word ABCDEE^ 1 D~ 1 C~ 1 B~ 
BCDEE~ l D~ x C~ l B~ l A~ x . This time the fold points to the left and the subword 
..B~ l B.. represents the fold between segments A and B. In terms of folds we now have 
2-fold word 01210. 

Note that we have two choices to construct the fold at reference position x^. Con- 
sidering the second structure formed, we can either break at the upper copy of X2 and 
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duplicate the DNA below, or break at the lower copy and duplicate the DNA above, the 
same product results. This symmetry is true in general, where we have the following. 

Lemma 2.1. If a BFB fold is positioned a length I from one end of a BFB product, 
with the upper portion duplicated, the BFB product cannot be distinguished from that 
arising when a BFB fold positioned a length I from the other end with the lower portion 
duplicated. □ 

The palindromic nature of BFB products then means we always have two choices to 
place the fold position for any given product. In what follows, we always assume that we 
are duplicating DNA above the position of the fold, with respect to the representations 
drawn in Figure [2j Note that this only applies to a palindromic product and so does not 
apply to the very first BFB event, for which every fold position and duplication gives a 
unique product. 

We now continue the process, iteratively building up the word. The next new fold is 
at position X3, occurring after the third fold of the third product. We thus keep the word 
containing the first three folds 0121, insert the new fold 3, and add the first three folds 
in reverse order 1210, resulting in 3-fold word 012131210. This fold is then deleted by 
the fourth BFB; the corresponding fold is positioned at reference position X4, occurring 
immediately after the second fold in the structure and we obtain 4-fold word 0124210, 
losing symbol 3. Although five folds take place, the final conformation 0124215124210 
then only contains four fold numbers 1,2,4 and 5. This can happen generally, if the fold 
occurs in the upper half we lose the middle position, which must be the previous BFB 
location, and information is lost. Furthermore, if we simply implement the BFBs 1,2,4 
and 5 we get the same product. We refer to this as a reduced BFB set. 

Note also that in our example the first BFB involved the loss of the segment to 
the right of x\, resulting in two ends extending in a leftward direction relative to the 
reference. We have lost all copies of the rightwards ends after this step, and every 
subsequent BFB results in two ends that always point in a leftward direction. In this 
sense the direction of the first BFB event is special. We refer to this direction as the 
BFB parity p, where p = 1 (resp. —1) indicates the ends extend to the right (resp. left), 
relative to the reference. 

We summarize the word representations of a BFB process as follows. 

Lemma 2.2. The region word Wi representation of the i th step of n BFB processes is 
constructed by taking the initial word W\ = S\S2---S n S~ 1 ...S^ 1 for a parity —1 BFB, 
or W\ = S^^S^ 1 ...S^ 1 S2---S n+ i for parity 1 and constructing the word W% recursively 
from Wi = Wi-i(ai)Wi-i(ai)^ 1 , where a% is the number of regions from the upper end 
of the i — 1 th BFB product that are replicated, and W(j) truncates word W to the first 
j symbols. 

For the fold word we have W\ = 010 and recursion Wi = Wi-\{bi)iWi-i{bi)~ l where 
bi — 1 is the number of duplicated folds. □ 

This gives us two representations for BFB structures. The region word is somewhat 
more cumbersome, but gives a readout of consecutive reference regions in the structure 
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and as such provides a contig of the underlying chromosome. Note also that counts of 
individual letters in the word also provide the vector ere of the copy number profile. The 
fold word is the more efficient representation. Counts of distinct symbols in this word 
provide the counts fn of the copy number profile. For example, the words associated 
with the final structure in Figure [2] have ten occurrences of B, the number of copies of 
the second region, and four occurrences of 1, the number of folds at position x\. Not 
all integer sequences can arise as a BFB copy number profile; the range of possible copy 
number profiles ere has been explored in detail in [6]. 

2.2 The Reverse Process 

We now consider the opposite problem; given a BFB word, we need to reverse the pro- 
cess and identify the events that have taken place. This represents the typical inference 
problem in genomics, where we have the final structure of a genome and wish to reverse 
engineer the process to identify the evolutionary history. This can be achieved by iden- 
tifying the unique BFB fold that demarcates the centre of the palindromic structure and 
undoing the duplication. For example, the BFB sequence of Figure [2] resulted in fold 
word W = 0124215124210. The centre fold 5 is undone, leaving 012421. This must arise 
from palindrome 0124210 so we undo 4 to leave 012, which must derive from palindrome 
01210. Undoing 2 and then 1 completes the sequence and the evolution of folds is 1, 2, 4 
and then 5. Note that we have reconstructed the reduced BFB set, not the full list; fold 
3, which was deleted by fold 4, is not included. Note also that 1,2,4,5 is precisely the 
order that the symbols first appear in the final word W = 0124215124210. 
In general we have the following result (see Appendix) 

Theorem 2.1. A fold word is a viable representation of a BFB process if and only if 
it can be reversed with the following algorithm. This identifies the unique reduced BFB 
sequence responsible for the word. 

STEP 1: Take palindromic fold word W = XnX^ 1 and undo fold n to construct X. 
Output re. 

STEP 2; Identify the rightmost uniquely occurring symbol m such that the fold word 
is ZYmY^ 1 for some (possibly empty) subsequences Z and Y which do not contain m. 
Undo m and contract to word W = ZY . Output m. 

STEP 3: IfW is empty the reduced evolution is the reverse of the output, else repeat 
STEP 2. 

For a viable BFB fold word W, the evolutionary order of BFBs is precisely the order 
that their corresponding fold number first appears in the word. □ 

Thus we find that if we know the genomic sequence arising from BFBs, we can 
reverse the process and indicate the unique minimal sequence of BFBs that lead to 
that sequence. Unfortunately, experimental data does not always contain such detailed 
information. The copy number profile, for example, is a more typical experimental 
observable, indicating the number of times different regions are present, but not the 
order that they are present in the chromosomal structure. Furthermore, we have not 
considered the random nature of the process and in particular the different orders the 
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fold positions can take. To help deal with these issues we next introduce a representation 
which captures the events that take place, rather than the sequence generated. 



3 BFBs as a Discrete Process 

Here we introduce a discrete representation of the BFB process. For every ra-fold word 
we obtain a sequence of n integers which proves more analytically tractible. This allows 
us to count the number of qualitatively distinct BFB evolutions for a set number of BFB 
cycles. 

3.1 BFB sequences 

Consider the evolution portrayed in Figure [2j After each BFB we have a folded structure 
with a finite set of DNA segments going forward and backward between fold positions 
relative to the reference. We shall construct a BFB sequence r n to represent these 
structures as follows. In order to specify a BFB event we have to indicate where the 
next fold is positioned on the current structure. The symmetry of the process (Lemma 



2.1) means we can specify that the duplication will always occur from one end, so we 
choose the top end of each structure as presented in Figure [2j We then have to indicate 
which segment the next BFB fold will occur upon. For reasons described below, the 
segment immediately after the mid point is labeled 1 and the labels of segments before 
or after are obtained by counting backward or forwards along the structure, respectively. 
The value r n is then the label for the segment containing the n th BFB fold. 

Thus for the example of Figure [2] we start with one segment. The first BFB occurs 
on this segment so we trivially write n = 1. This produces two segments, and so two 
choices for the location of the next BFB fold. In our example this occurs on the edge 
below the midpoint, so we have r% = 1, producing four segments. The next BFB occurs 
on the last segment, two segments after the midpoint, so r% = 2. The next BFB loses the 
3 rd BFB fold, occuring two segments before the midpoint, so counting back from 1, we 
have r4 = —1. The final BFB gives us r§ = 3 so we have BFB sequence r = [1, 1, 2, —1, 3], 
as indicated in the fifth column of Figure [2j 

We noted previously that because the 3 rd BFB is deleted by the 4 th , the end product 
can be obtained by simply implementing the undeleted BFBs. Note that the 4 th BFB 
fold can be positioned on the third segment after the midpoint of the structure arising 
from the 2 nd BFB event. We can thus equally represent the final structure with reduced 
BFB sequence r = [1,1,1,3]. Note that this can be derived from the full sequence 
r = [1, 1, 2, —1, 3] by absorbing the negative value —1 into the preceding value 2, giving 
new (emboldened) value 1. 

Consider the cumulative sum of the full sequence, s = [1,2,4,3,6]. We have two 
observations. Firstly, note that s n indicates the number of segments into the structure 
that we first encounter a copy of folds arising from the n th BFB. For example, the fold 
from the 5 th BFB, is 6 segments into the final structure. Secondly, values p(— l) s ™, 
where p = — 1 is the parity, provide a direction signature, d = [1,-1,-1,1,-1]. Each 
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number d n gives the direction of all copies of the n th BFB fold, relative to the reference. 
For example, all copies of the fold from the 2 nd BFB, at position X2, point to the left 
(e?2 = — 1). Note that all these signs would be flipped if the original BFB had reversed 
parity, facing the opposite way. 

Representing the process this way thus enables us to capture some nice properties. 
We summarize this as follows. 

Theorem 3.1. Any m-fold word representing a BFB process has an equivalent repre- 
sentation with a sequence of m integers {r n } that satisfies —s n < r n +i < s n where 
s n = Y^A=x r « > 0. Values s n indicate the number of segments into the structure that 
we first encounter a copy of folds arising from the n th BFB cycle. Each term r n in the 
sequence represents a BFB event. Negative terms indicate deletion of previous BFBs 
which shorten the structure. The reduced representation contains strictly positive values 
and is obtained by combining r n _\ andr n into the single term r n -\+r n whenever r n < 0. 
This is repeated until only positive terms remain. The number of reductions using r n 
equals the number of BFB events deleted by the n th event. The direction of all copies 
of folds from the n th BFB is given by d n = p(—l) Sn , where p is the parity of the initial 
BFB. □ 

We now have a way of representing BFBs. We will see that each representation 
can account for many different BFB structures. The two structures given in Figure 
[3]Ai,iv are both represented by sequence [1, 1, 2, 2, 1], for example. We build this example 
sequentially. We start with a single segment, trivially represented by order word Wo = 
06, which undergoes a BFB at position x%, where xq < x\ < xq. This loses end xq, which, 
reporting folds as we read through the structure, we associate with word W\ = 010. The 
next BFB fold occurs on the segment after the midpoint, which we represent as word 
W2 = 01210. The fold occurs at some position xi where Xq < x-i < x\. The next segment 
occurs two segments after the midpoint, where r 3 = 2 and S3 = 4. This is represented 
by word W 3 = W 2 (s 3 )3W2(s 3 y 1 = 012131210 (W(m) represents the first m letters of 
word W, and the negative power reverse the order) and we find that Xq < x% < x±, or 
xw 2 ,s 3 +i < x 3 < xw 2 ,s 3 > where Wk, n is the n th letter of word Wk- We thus find that there 
are several order relationships on the reference positions of the BFB; we have a partially 
ordered set (poset). 

The general situation is described in the following result. 

Lemma 3.1. Define W n to be the word built with recurrence relation W n = W n -\(s n )nW n 
initialised with W\ = 010. If x n is the reference position of the n th BFB, and d n is the 



direction of Theorem 3.1, then we find that the following partial orderings apply to the 



positions of the BFB folds: 

d n = -1 => x Wn _ 1>Sn+1 <x n < x Wn _ l 3n u 

d n = 1 => X Wn _ 1Sn <X n < X Wn _ 1>Sn+1 

For example [1, 1, 2, 2, 1] we thus obtain restrictions < X2 < xi, < X3 < x\, xi < 
£4 < x\ and X4 < X5 < x\. There are several different orders that satisfy these criterion, 
Figure [3Ki,iv being two such examples, where i has order xq < x 3 < X2 < xa < X5 < x\ 
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Figure 3: Poset of folding structures for a given BFB sequence. In Ai,iv we see two 
possible arrangements resulting from reduced representation [1, 1, 2, 2, 1]. gives the 
copy number profiles of each. In B we see the poset graph construction representing the 
possible orders of positions. Nodes represent folds, edges represent inequalities between 
fold positions in the reference. Solid and dashed edges indicate major and minor edges, 
respectively. Black and orange edges indicate plain and flipped edges. Trees in C indicate 



how Lemma 3.3 is used to count the number of possible orders in the Poset. 



and iv has order xq < x% < x± < x$ < x$ < x\. Note that this reordering of the fold 
positions induces a permutation in the fold numbers fn = [2, 2, 2, 1] — > [2, 2, 1, 2], which 
in turn alters the copy number profiles (Figures [3]Aii,iii). The copy number profiles for 
a given BFB sequence will thus be distinct in general unless the permutation permutes 
identical fold numbers. 

It is natural to attempt to count and construct the different orders we get for a 
single BFB sequence. We do this with the aid of a 2-d tree construct; a special kind of 
directed graph that generalizes the notion of a tree, as exemplified in Figure [3p. When 
constructing a standard rooted tree, we can build from the root, recursively extending 
the tree with a single node and edge from a node that is already present. A 2-d tree 
differs in this respect in that once we have one edge and two nodes, each new node has 
two parent nodes already present. Two edges are then constructed from these two nodes 
to the new node [2]. 

We construct a 2d-tree as follows. Each new node represents a BFB cycle, with label 
n representing the n th fold. Edges represent the ordering. When the n th fold is formed, 
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it is positioned on a segment between two pre-existing folds from positions x a to Xf,, 
where we have a < b, without loss of generality. We then construct two directed edges 
from nodes a and b to n. For example, the second fold in Figure [3]A.i has position x% 
with xo < X2 < xi, we thus construct two edges from nodes numbered and 1 to a node 
labeled 2 as given in Figure [3|3ii. 

We next introduce some classes for these edges. 

Definition 3.1. Each pair of edges introduced during the 2d-tree construction consists 
of a major and minor edge. The major edge (represented as solid edges in figures), 
extends from the node with greater value b, and the minor edge (dashed), extends from 
the other node labeled a. 

Definition 3.2. Each pair of edges and daughter node are either plain or flipped. If 
the word W n -i prior to the formation of fold n changes from ..ab.. to ..an., (b > a), the 
two edges and daughter node are plain (black), if they change from ..ba.. to ..bn.., they 
are flipped (orange). 

For example, in Figure [3piv we see node numbered 4 extending from nodes 1 and 
2. The major edge (solid) then extends from the larger source node numbered 2. 
This node represents the introduction of the A th fold where word 012131210 becomes 
0121314131210. Because the two source nodes are increasing 12 in the word, both 
edges and daughter node are termed plain (black). Conversely, the 5 th fold arises when 
0121314131210 becomes 012131454131210 so the edges and node are flipped (orange). 

The following observation is important when we later consider the number of possible 
evolutions from a fixed number of BFB cycles. 

Lemma 3.2. Consider a node with value n (n > b > a) constructed such that the major 
and minor edges are attached to nodes with values b and a, respectively. Then: 

I // node n is plain (black), any new node with major edge connected to n has a 
minor edge connected to n's minor parent a. 

II // node n is flipped (orange), any new node with major edge connected to n has a 
minor edge connected to n 's major parent b. □ 

Thus consider Figure [3piv, for example. Node 4 extends from flipped node 2 and 
so has major edge connected to 2 and minor edge connected to 2's major node 1. Node 
5 extends from plain node 4 and so has major edge connected to 4 and minor edge 
connected to 4's minor node, 1. 

This construction is termed the 2-d Poset Tree, P. Now, the parental nodes labeled 
a and b must correspond to a segment in the folded structure, so their positions are 
ordered relative to each other. The introduction of the n th node is ordered relative to 
both. Because the node labeled b is already ordered relative to a, we only need consider 
the order of n relative to b. That is, the major edges describe the ordering information 
for the fold positions. By ignoring the minor edges the 2d-tree construction then becomes 
a standard tree construction, such as [3pv. This is termed the Order tree, T. 

We now find that any pair of nodes on the same tree branch correspond to folds 
with fixed relative positions, whereas nodes on distinct branches correspond to folds 
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BFBs 


1 


2 


3 


4 


5 


6 




10 


Reduced Sequences 


1 


1 


2 


7 


41 


397 




627,340,987 


Reduced Copy Number Profiles 


1 


1 


3 


19 


247 


6445 






Reduced Evolutions 


1 


1 


3 


21 


315 


9765 




10,180,699,028,325 


Full Sequences 


1 


2 


6 


26 


166 


1626 




2,290,267,226 


Full Copy Number Profiles 


1 


2 


5 


24 


271 


6716 






Full Evolutions 


1 


2 


8 


64 


1024 


32768 




35,184,372,088,832 



Table 1: Counts of distinct representative BFB sequences, evolutions and copy number 
profiles. 

that are not ordered relative to each other. If we have two branches of k\ and k<i nodes 
descending from the same parental node, we find that there are fcl+fc2 C/ cl! / C2 ways of 
intercalating their positions. If there are already o\ and 02 different possible orders for 
each branch we then find oio 2 1+k2 Ck 1 ^ 2 possible orderings. We thus find that if we place 
k Ck 1 ,k 2 ,...,k B = faTfc^r k B \ a * eacn node, where k is the total number of daughter nodes 
and kb denote the number of daughter nodes down branch b £ {1, 2, B}, the number 
of orders is obtained by simply multiply these combinatorial terms together, such as in 
Figure [3]0ii. Combinatorial terms at each end of an edge will then largely cancel (Figure 
J3piii) and we are left with the numerator n\ at the root node, where there are n + 1 
nodes in total, and denominators re& which count the number of daughter nodes, plus 
one, for each node b. We then see in Figure [3piii,iv that BFB sequence [1,1,2,2,1], 
corresponding to word 012131454131210, has 4 possible structures. Summarizing, we 
have the following. 

Lemma 3.3. Let T denote the order tree of a 2-d poset tree P deriving from a fold word 
or BFB sequence on n BFB cycles. Tree T then has n + 1 nodes. If and each node below 
the root has a label mj, counting the number of daughter nodes plus one, then the number 
of orders is given by 0{T) = yj n mb ■ 

3.2 The Size of BFB Space 

We now consider the question of how many different BFB sequences are possible, both for 
the case of reduced and non-reduced sequences. Although closed forms for these counts 
would seem intractable, we can derive counts recursively, where we have the following 
result. 

Lemma 3.4. Let v\=w\ = (1,0,0, ...) denote the infinite vector with single unit entry. 
We construct general vectors v n with components v n ^ m through the recursive relation 
v n+i.m = Yy^--[ m+1 ] Vn > k ' Then the number of reduced BFB sequences of length n is 
Ylmi v n,m- Applying the recursion w n+ i^ m = Ylk^l^H w n,k yields the number of full 
BFB sequences of length n as Yl m w n,m- O 
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The resulting counts can be seen in Table 1, where we see the number of sequences 
grow rapidly with the number of BFBs. We now turn to the enumeration of the number 
of distinct evolutions for each BFB sequence, where we have thew following result. 

Theorem 3.2. If n BFBs take place then the total number of distinct evolutions is 

n ( n — 1 ) 

given by 2 2 . The total number of evolutions that retain copies of all folds is given 
by nr=i(2 n — !)• The proportion of evolutions that do not lose information then tends 
to limit nr=i( 1 - 2 ~ n ) = °- 288 - 

We can also see these counts in Table [TJ where we see the number of evolutions rising 
super-exponentially as a function of BFB count, becoming incomputable beyond about 
eight BFBs. 

The proof of this theorem relies on an appropriate induction. In Figure|4]A we observe 
the first few evolutions. The initial fold gives rise to one structure (ignoring parity). The 
structure has two segments that a subsequent fold can form along, resulting in two new 
structures of Figure |4]A.ii. These two structures have two and six segments between them 
that a new fold can form along, giving eight new structures in Figure |4]^iii. The number 
of segments in any structure then determine the number of possible structures we can 
get with one more fold. 



Unfortunately, this does not help us explain Theorem 3.2 Curiously, to prove this 
we need to reintroduce the first fold. In Figure [4j3 we see the four possible structures 
that correspond to the 5-fold word W = 023242565242320, along with the sequence of 
word operations that generate W. For each word W = AnA^ 1 in Figure [4^3, C we write 
An., for brevity. In Figure |4p we see several possible ways of introducing an initial fold 
that preserves the order of the other folds in the word. For example, from the word 
010 we can introduce fold 2 before or after 1 to give us 020 or 01210. The word 020 
then follows the same evolution as Figure [4^3, whereas 01210 again provides two choices; 
remove the second 1 to give 0123210, or introduce fold 3 after the second copy of fold 
1 to give 012131210. When we follow this decision process through all five folds we get 
nine words. We then calculate the number of orders for each word with Lemma 13.31 and 
find that we have 2 5 times the original number of orders. To explain this we need to 
introduce a class of operations on the two-dimensional poset trees introduced above. 

We constructed an order tree from the 2d-tree by removing all minor edges. We 
require the capacity to modify the shape of an order tree with the following operation. 



ES: Edge Switching 

Remove a major edge and replace the corresponding minor edge. 

This move can be seen in figure [5j and effectively moves a branch nearer to the root, 
and results in a tree structure. This move has no effect on the other edges or the nodes 
they are attached to. ES operations thus commute; we can perform the moves in any 
order and get the same structure. 

This is a specific form of the Subtree Prune and Regraft (SPR) operation that has 
seen application to many other problems in evolution |16j . 
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0232425.. ■ 



02324256.. 
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01 > 02 



• 023.. 



02324.. - 
012324.. ■ 



012.. — -+-0123.. — -»- 

* 01213.. v 0123214 



0232425.. - 
01232425.. 



01213124.. 
01 



01 23214125..- 
■ 0121312425.. ■ 
124215.. 



02324256.. 
012324256.. 
01232141256.. 
■ 01213124256.. 



2131214. 01213124215.. ► 012131242156.. 

4125 N 0121312421516.. 
215..^ 0121312141256.. 
V* 01213121412156.. 



0121312V 



012131214121516.. 



[4] 
[4] 
[5] 

[15] 
[10] 
[20] 
[20] 
[20] 
[30] 



4.2° 



Figure 4: Evolutions of BFB cycles. In A we see one structure with one fold (i), two 

3(3-1) 

structures with two folds (ii) and 2 2 structures arising from three folds (iii). In B 
we have the 5-fold word 023242565242320 along with 4 possible structures. In C we see 
the introduction of a first fold gives rise to 4.2 5 possible structures. 



We are interested in the following sets of ES operations. Let S(T) denote the set of 
subtrees of a given order tree T that include the root node. Then T s is the tree obtained 
from T as follows. 

SS: Subtree Switching 

I Perform an ES operation on any flipped (orange) edge contained in the subtree. 

II Perform an ES operation on any plain (black) edges adjacent (so not contained) 
to the subtree. 

III Leave the remaining edges alone. 

Examples can be seen in Figure raj In row (*) we see the subtree with the edges 02, 
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Figure 5: Edge Switch operation: Remove the major edge and replace the corresponding 
minor edge. 

23 and 24. Edges 23 and 24 are flipped, so we replace them with their corresponding 
minors (3rd column) (SS move I). 02 is untouched. Plain edge 35 is adjacent to the 
subtree, so it is also switched (SS move II). 

We find there is a unique correspondence between the possible introductions of a 
first fold 1 and the SS operations. 

Lemma 3.5. Let T be the order tree with n + 1 nodes associated with an n-fold word 



according to Lemma 3.3. Let s denote the set of major edges that are directed towards 
nodes labeled m where we have replaced the recursion W m = W m -i(s m )mWm — l(s m ) _1 
with W m -i(s m )lmlW m -i(s m )~ 1 in the evolution of the word. Then s is a subtree of T 
and the corresponding order tree is obtained by implementing SS operations on s. □ 

We see examples of this in Figure [6} The connection between these tree operations 
and Theorem 13.21 can now be introduced. 



Theorem 3.3. For any order tree T with n + 1 nodes, let S denote the set of subtrees 
with the same root. Then, ^2 se g 0(T S ) = 0(T)2 n , where O is the order function of 
Lemma\3., 



This result follows from the following observation, which can be proved inductively. 

Lemma 3.6. Let T denote a tree with n + 1 nodes such that there is exactly one node 
directly below the root (so n > 1). For any subtree s 6 S(T) we let b s denote the number 
of daughter nodes from node b, plus one, after implementing SS on the subtree s. Then 

V n(T) -S n CrO{T) r = l,2,...,n-l " 

Z^{seS(Ty.b 3 =r} u \- L s) - <; 20(T) r = n 

Examples of this can be seen in Figure [6| where the counts are broken down according 
to the values of node b s (in blue). 

Summing over the possible values of r then gives us 2 n O(T), returning the result of 
Theorem EH □ 

An example of this can be seen in the last column of Figure [6j where a graph 
corresponding to 5-fold word 023242565242320 and BFB sequence [1,1,2,2,1] with 4 
orders results in 4.2 5 new orders when a first fold is introduced. 
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Word 



Subtree 



ERTree 



No. Orders 



02324256.. 



012324256.. 



8 = 0(T).2 



0123214256.. 



01213124256.. 



0121312421516.. 



*01 21 31 21 4256.. 



• — »- • 

2 



2 V- 





15 



20 



20 



20 = O(T) a C4,, 



40 = OOfCs, 



012131242156.. 



012131214121516.. 



01213121412156.. 



2V- 



3 5 6 

O 



/' 

o 2 X"" 




10 



30 



40 = 0(Tf C2, 



20 20 = O(T) 5 C,,4 



Figure 6: Subtree order counts are given for the tree corresponding to word 
023242565242320 from Figure |4j The first column indicates the all nine words arising 
from introduction of fold 1. The second column indicates corresponding 2-d tree subtrees 
in bold. The third column indicates the order tree after implementing SS operations. 
The fourth column counts the orders. 



We are now in a position to prove our main result and count evolutions. We have 
seen that any n-fold word corresponding to an order tree T has 0(T) possible structures. 
These are associated with 0(T)2 n possible n+ 1-fold structures by the introduction of a 
new first fold. If we start with the trivial structure and inductively perform these fold 
introductions, we find that we have 1.2 .2 2 2 n_1 = 2 1+2 +---+ n ~ 1 possible evolutions. 

n(n — 1) 

That is, there are 2 a possible evolutions using n folds, as required. 

The formula for the reduced sequences is obtained by ignoring the single choice in 
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the introduction of the first fold that loses the first fold (that is, we do not require the 
first line of Figure |4p). □ 
The total number of distinct copy number profiles were also determined for fixed 
BFB counts, as summarised in Table [TJ Clearly the number of copy number profiles 
is smaller than the number of evolutions and there may be several evolutions for any 
given copy number profile. Furthermore, we can construct an infinite number of BFB 
sequences with negative values that all reduce to any given reduced sequence. We thus 
need other methods to help identify the correct evolution for any given copy number 
profile. 

So far we have treated BFB cycles as a discrete process, treating the folded structures 
as functions of BFB sequences, a sequence space which we have now explored in some 
detail. However, the BFB process relies on the fold occurring somewhere along the 
length of the structure. We can thus consider the fold positions in a sequence of BFB 
structures as a stochastic process, and investigate the implications of this on the BFB 
structure. 



4 BFBs as a Stochastic Process 

We first consider the stochastic nature of the structures length under the simplest as- 
sumption that the fold position is uniformly distributed along the structure. We then 
consider the likelihood associated with certain BFB representative sequences, showing in 
particular that structures for some BFB sequences are more likely to occur than others. 

4.1 Length Distributions 

So far we have considered each BFB product as a structure folded with respect to a set 
of reference positions. We now imagine unfolding the entire structure at each stage. 
Such an example can be seen in Figure [7j Here we start with a product of length Lq = 



L. From Lemma 2.1 we can assume that duplication is on the left side of the position of 
any BFB fold. The break occurs uniformly along this length, so after duplication, repair 
and unfolding, the next length L\ G [0,2Lo]. The first fold in this example is beyond 
the midpoint of the previous structure, so the resulting length increases, as it does for 
the next two BFBs. However, the fourth fold occurs in the first half of the previous 
structure, reducing the length and removing the second and third folds before the final 
BFB again extends the structure. 

We then see that the length L n is a stochastic Markovian process with conditionally 
uniform distribution ~ U([0, 2L n _i]). The general length distribution P(L n ) 

can then be derived, giving the following result. 

Theorem 4.1. If Lq = L is the initial length of the chord, then the length L n after the 
n th BFB has distribution 

pt Ln} = S z^z 10 ^' 1 \^r) Ln - 2 " L 

L„ > 2 n L 
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Total BFB Product Length 
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Figure 7: The lengths of the unfolded products of a sequence of five BFB cycles 



with mean value L and standard deviation Ly (§)" — 1. □ 

Thus we find that although the lengths average value does not change, it is increas- 
ingly variable. We also see that the shortest distance ^ of any copy of the m th fold 
from the ends is preserved. The first fold encountered is always a distance k^- from either 
end, for example. All copies can be lost however, and we have seen in Figure [7] that as 
the BFB process continues, BFB events that shorten the structure can delete all copies 
of folds from some previous BFB events. We can characterize these properties as follows. 

Theorem 4.2. The original distance =2* of the m th BFB fold from the end of the 
structure is the shortest distance of any subsequent copy of that fold to either end. If 
L n < L m and n > m then all copies of the m th BFB fold are permanently excised from 
the BFB product. Thus if the m th BFB fold is to avoid extinction through a series of N 
BFBs then L m < minf n>m \L n . Subsequently, if we have a series L\, L2, -kjv of BFB 
lengths, the only BFB folds that survive will be a subset with increasing length, in the 
same order that they occurred. □ 

This raises two issues. Firstly, if we observe a sequence L\ < L2 < ... < Ln of BFB 
lengths in a final structure we would like to know how many folds from other BFB events 
have been completely excised from the genome in the process. Secondly, we know that 
the smallest length L\ is the earliest remaining BFB. The fold at position 4r is thus 
the first encountered as we traverse the structure. This also gives the position of the 
outermost fold relative to the reference. For example, in Figure |3j^i, the first fold, at 
position xi, is furthest from the ends of the structure, relative to the reference positions. 
The first fold thus measures the size of the amplicon. 

A better understanding of the order statistics of the length sequence L n will help our 
understanding of both the scale of deleted BFB folds, and the size of amplicons, where 
we have the following result. 

Theorem 4.3. The probability density n{ x i L) that the k th BFB of a series L\, L2, 
is the minimum with length x is given by 
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M KN {x,L) = 2feW fc (x,L)(l-£«=i 
where L is the original length and, 



W k (x,y) = J^J^...J Q 



z\...z h 
ik 



dz h ...dz x = Ei=oo}+i(a?) ^g j {2 k y), 



a is t/ie + 1 length vector Yl r =i ^r, and B r is the (r + 1) x r matrix 



( -log(2 r - 1 x) -^log 2 (2 r - 1 x) 



V 



1 








1 

2 












□ 



/ 



We can use this to get the distribution of both the minimum length and its occurrence 



in the BFB sequence, as indicated in Corollary 4.1 -hi below. 



The result also enables us to get the distribution of the amplicon size, that is, the po- 
sition L amp of the outermost fold relative to the reference, ^minkKNiLk}, as summarized 



in Corollary |4.1[ v. 

We next consider an observed sequence of BFB folds with corresponding lengths 
L\ < L2 < £3 < ... < L n and estimate how many BFBs were likely to have been deleted 
in this process. Specifically, if we have a sequence of BFBs with lengths li t i, h di > 

Li,h,l,h,2, ■■-,l2,d 2 ,L2, ...,L n such that Lj_i < U < l^x, l i>di , then by Theorem [42 
the BFB with length Lj deletes the di earlier BFB folds with longer lengths l^i, li )di to 
leave the events L%-\, Li. We can use Theorem 4.3 to estimate the scale of loss, d%. If we 
have a BFB of length L^\, which is followed by the sequence Z, x, > L^—i, then we 
require In, > Li, given that we start with length Lj_i. That is Pr(li\, l{ <^ > 
Lj|Lj_i) = 2d ^ Jl 1 *' 1 Jl- 1 ■■■ $L d l h — \f-rdld---dh- This can be calculated in much 



the same way as Theorem 4.3 (see Appendix). A Bayesian inversion then allows us 
to estimate the distribution of di, given in Corollary |4.1| y. In summary we have the 
following. 

Corollary 4.1. (i) The probability that the k th of N BFBs is the one with the minimum 
length L min is given by „™ k f. {Lm ;?' L) .. . 

(ii) The probability density function of the minimum BFB length in a sequence of N 
BFBs is given by Ylk=i M k:N (L min , L). 

(Hi) The distribution of the number N of BFBs for a given minimum length L m i n is 
then given by Pr(N = n\L mm ) = J^J^ (L ™"' L) . 

(iv) The amplicon size, L amp , given a sequence of N BFB cycles has distribution 
2 J2k=i Mk,N(2L amp , L). 

(v) If we observe two BFBs with consecutive lengths Lj_i < Li the number of BFBs 
occurring between them that are deleted by the i th BFB, Di, is, Pr(A = d\L l - 1 < Li) = 



i d > where I d = 1 - J2t=o 2* W k(Li, Lj-i) and W {Li, Lj_i) = 1. 



^d-1 J_ 



□ 



Some of these distributions are plotted in Figure |8j where we see the trend that the 
outermost fold of the BFB, that is, the size of the amplicon, decreases as the number of 
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Amplicon Size/L 0rder of minimum Length BFB 



Figure 8: BFB Distributions. A) The distribution of BFB counts for a range of amplicon 
sizes (as a proportion of L). B) The distribution of the amplicon size for a range of BFB 
counts. The mean positions are located at the circles. C) The mean and median number 
of BFBs as a function of amplicon size. D) The expected number of deleted BFB events 
as a function of the ratio Li : L^i. E) The distribution of the minimum length BFB for 
ten BFBs and a range of amplicon sizes. 



BFB events increases. 

This can be intuited as follows. If a BFB product has current minimum length L m i n , 
then there is a chance the next fold will be smaller than L ™ in , deleting all previous folds, 
and reducing the position of the outermost fold. That is, the BFB process will result in 
atrophy of the amplicon size. 



We also know from Theorem 4.1 that the average length of the structure does not 
change. This means that, on average, the same amount of DNA is present in a diminish- 
ing region of the reference genome, resulting in the localized high copy number structures 
typical of amplicons, such as Figure [Tp. 

4.2 Fold Structure Likelihoods 

We have seen different BFB structures arising due to different BFB sequences. It is thus 
natural to investigate the likelihood of a particular BFB sequence occurring. We extend 
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the stochastic process approach above to elucidate this problem. 

Suppose we are interested in the likelihood of observing BFB sequence such as r = 
[1, 1,2], the fourth structure in Figure § We can build the likelihood inductively. The 
first fold xi is uniformly distributed across the original structure of length L, so we have 
Pr(x\\r\) = j j . The second fold occurs at position X2 on the second segment (r 2 = 1) and 
so satisfies the inequality xq < X2 < x\. It is uniformly distributed along this segment, 
so Pr(x2\x\,ri,r2) = Xl ^_ XQ ■ Now Pr(r% = 2\x±, X2, n> r 2) is the chance of hitting the 
last of the four segments of the structure corresponding to [r±, r2] = [1, 1]. This segment 
has length x\ — xq and the total length of the structure is 2{x\ — xq) + 2(x\ — X2) = 
4xi — 2x2 — 2xq. If we suppose xq = and L = 1 for simplicity then we get probability 
ixi-2x 2 ' ^ e can then can put this information together to get the probability of getting 
BFB sequence [1, 1,2] conditional upon [1, 1] as follows: 



P([l,l,2]|[l,l]) = f P([l,l,2],x 1 ,x 2 \[l,l})dx 1 dx 2 

J0<X 2 <X1<1 

P([l, 1, 2]\X!,X 2 , [1, 1])P(X2\X!, [1, 1])P(Z1|[1, l])d Xl dx 2 

0<x 2 <x 1 <l 

P(r 3 = 2\xi,X2,ri,r2)P(x2\xi,ri,r 2 )P(xi\ri)dxidx2 



0<x 2 <a;i<l 



xi 11,, 1 , 

-dx\dx 2 = — log 2 



lo<x 2 <x 1 <i - 2x 2 x x 1 2 

This process can be applied in general which we summarized below. 

Lemma 4.1. The likelihood of seeing reference positions x n for a given BFB sequence r n 

is given by, P(xi,x 2 , x n \n, r 2 , ...,r n ) = n"=i{^ h^. }, where x imm < Xi < x imax 

are the inequalities of the poset for r n given by Lemma \3.1\ 

The conditional probability of next BFB sequence element r n is Pr(r n \x\, x n -\, 
7*1, ...,r n _i) = where L n _i is the length after n — 1 BFB cycles, a linear 

homogeneous function of x\, x n -i- We then find 



Pr{r n \n, ...,r„_i) = / Pr(r n , x\, x„_i|ri, r n -x)dx 
J A 

r - n 1 

= / Xn y Xnmm . l\{ i }dx, 

JA L n -l{Xl, X n -l) X imax ~ X imin 

where A is the region defined by inequalities Xi min < x% < Xi max . □ 

The probabilities for the first few BFB sequences can be seen in Figure [9| These 
integrals rapidly become intractable and numerical methods are required. The simplest 
method is to randomly generate x\, x 2 , ■■ according to the conditional uniform distribu- 
tions and average the simulated probabilities P(r n \x\, x 2 ,..., x n -\, n, r n _i). 
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Figure 9: Probabilities of fold events in BFB space. The black chords indicate the 
structure for each BFB sequence (indicated in red). Numbers alongside arrows indicate 
the probability of taking that step. 

Note that the first BFB has a 0.5 probability of having positive or negative parity, 
this is not encapsulated by this formula. Note also that these probabilities are not for 
reduced BFB sequences. For example, although BFB sequences [1, 1, 1] and [1, 1, 2, —1] 
reduce to the same structure, they take different paths through the evolutionary graph of 
Figure [9] and have different likelihoods of occurrence. Multiplying the edge probabilities, 
we find the probabilities of arising are 0.038 and 0.008, respectively. 

5 Applications to Amplicons in Cancer Genomes 

We now put the structure we have described into context with some real data. 
5.1 Inference of BFB evolution 

For a given amplicon such as Figure [Tp, identifying the underlying BFB sequence is 
desirable because it provides an explanatory evolution of events. This would allow us 
to obtain the order that the folds occurred in this process, and hence construct the fold 
structure that generates the amplicon. 
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Rank 


BFB Sequence 


Copy Number Profile f 


Order 


log Pr(x|r) 


log Pr(z|c) 


Log-Likelihood 


1 


[1,1,2,2,3] 


[16,12,14,6,2] 


[1,5,4,2,3] 


-81.674 


-921.1 


-1002.774 


2 


[1,1,2,4,1] 


[16,12,14,6,2] 


[1,4,2,5,3] 


-82.264 


-921.1 


-1003.364 


3 


[1,1,2,4,5] 


[24,20,22,10,2] 


[1,4,5,2,3] 


-82.264 


-2008.9 


-2091.164 


4 


[1,1,2,2,1] 


[12,8,10,6,2] 


[1,5,2,4,3] 


-81.674 


-2056.7 


-2138.374 


5 


[1,1,2,2,5] 


[20,12,14,10,2] 


[1,5,2,4,3] 


-82.038 


-2076.4 


-2158.438 



Table 2: Top five likelihoods for evolution of Figure fCopy numbers only include 
regions II - VI for chromosome undergoing BFB process, other chromosomes are ignored. 



We would like to use the machinery developed above. There are some extra dif- 
ficulties, however. Although the experimental signal (read depth) is a linear function 
of copy number, in many cases it can be difficult to ascertain the actual integer copy 
number of each segmented region with methods such as [5], especially when the regions 
are small. However, the signals across the amplicon can often be ranked, and the likely 
number of BFBs identified, meaning we can filter the set of possible evolutions to a more 
manageable set. 

For example, consider the amplicon of Figure [Tp. We have six segmented regions / to 
VI, separated by five breakpoints. For four of these we used next generation sequencing 
to find rearrangement positions j3] and found discordantly mapping reads consistent 
with BFB events. Although no aberrantly mapping reads could be found at the junction 
between regions I and 77, this was likely due to mapping difficulties and the data are 
indicative of a structure formed by BFB cycles. The rightmost region, VI, has a higher 
signal than the leftmost, I, suggesting a BFB with right parity (p = 1); region / is not 
part of the BFB structure. This gives us five segmented regions, and so five folds to 
explain. The fold positions are labeled X{, i = 1, 5. We select the most likely evolution 
as follows. 

The mean (sequence depth) signals for the six regions were z = [154.7,519.8,398.2, 
465.2, 305.5, 186.3] . This is based upon m = [5578, 6716, 3969, 2536, 8768, 5366] measure- 
ments (bins containing reads) in each region. We also measure the standard deviations 
in each region a = [31.4,80.1,63.1,67.8,69.5,45.1]. Then we assume that for any given 
evolution with BFB sequence r = [r\, ...,r n ] and copy number profile cn = [ci , c n ], we 

2 

have normal distribution (zi\ci) ~ N(a + (3ci, where a and f3 are parameters that 
represent the linear relationship between signal and integer copy number. If we have 
good information on this relationship a and /3 can be stated, otherwise we treat them 
as unknown parameters. We then construct a likelihood Pr(z,x|r, c) = Pr(z|c)Pr(x|r) 



with the help of Lemma 4.1 This is maximized over a and f3 and the likelihood recorded. 

For Figure [Tp we found 24 evolutions produced copy number profiles with the same 
rank as Zi, arising from 9 distinct BFB sequences. The top five are listed in Table [2] 
The maximum likelihood solution suggests the actual copy numbers are [16, 12, 14, 6, 2], 
corresponding to the BFB sequence [1, 1, 2, 2, 3]. The resultant genome is given in Figure 
[Tp. 

We note that even with all this information we cannot necessarily guarantee a 
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strongly identified best fit. For example, we see the top two ranked evolutions pro- 
duce the same copy number profile and it is only the likelihood Pr (x|r) that weakly 
distinguishes these two cases. There is the possibility of utilizing additional information 
in the form of single nucleotide mutations within the amplicon. This is an approach the 
has been applied in a more general context previously [4 J , which can also be used to time 
rearrangement events relative to the nucleotide mutation process. However, amplicons 
are often quite narrow and may not contain sufficient mutations to give much statistical 
power to further differentiate evolutions and was not explored further. 

The portrait of BFB cycles sketched in Figure [T}\ would appear to continue indefi- 
nitely. However, this process will stop if we only have one centromere after DNA repair. 
This may be because a somatic telomere forms on one of the broken ends, but can also be 
because one of the exposed ends is attached to a different exposed end, such as another 
chromosome. This pattern can be observed in the data where, in Figure 10 for example, 
we see a region with three breakpoints in chromosome 12, two of which are assoicated 
with BFB folds, and the middle one associated with a translocation to chromsome 11. 
This was likely to be the last step, terminating the breakage fusion bridge process. 




B 












Chrl2 








Chrll 



Figure 10: A BFB cluster. In A we see the amplicon, the two outer breakpoints are BFBs, 
the middle one a translocaiton to chromosome 11. B indicates the genomic structure. 



5.2 Clonality of BFB Process Under Selection 

Our models have so far assumed that the BFB mutation process is randomly sampled 
and under no forces of selection. This is unlikely to be the case in general and the 
selection of any cells that have growth advantage are likely to emerge in cancer samples. 
We have seen in Figure 1A that the BFB sequence arises due to spindles attached to 
the dicentromeric chromosome, dissecting the chromosome during cell division. This 
preserves the total amount of doubled DNA in both daughter cells. For a given break, 
we find the DNA is duplicated one one side of the break in one daughter cell, and on 
the other side in the other daughter cell. In Figure [T]A, for example, one daughter cell 
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contains two yellow and four red genes, the other daughter cell two yellow genes, but 
the total number of red and yellow genes across both progeny is conserved at four. We 
then find that if the parent cell has BFB sequence [rx,T2, r n _i] with s n _i = Yl7=i r «> 
and one daughter cell has sequence [r±,r2, ...,r n _i,r], then the other daughter cell has 
sequence [n, r%, r n _i, 1 — r]. Given that we must have — s n -\ < r < s n _i, one of r 
and 1 — r must be non-positive. 

If this process proceeds over n cycles, and so n cell divisions, producing 2 n cells, we 
are left to conclude that one of the cell lineages will consist entirely of positive terms 
and hence lose no BFB folds. This lineage is always gaining DNA by Theorem 3.1 and 
will be a good candidate to contain multiple copies of genes that may be advantageous 
to cancer. This cell may then emerge as a dominant clone, resulting in subsequent rapid 
amplicon development. 

We can also argue this from a different perspective. Note that the distribution of the 
amplicon size in Figure [8j3 has a mean value that moves toward the origin as the number 
of BFBs increases. If a gene is, for the sake of argument, half way between centromere 
and telomere, and five BFB cycles take place, one can integrate this distribution up to 
that point to conclude that there is approximately a 95 % chance that the outermost fold 
is before the target gene and therefore only one copy of the gene is present (on the other 
allele) in the cell. Initially each cell has two copies of a gene. After 5 cell divisions there 
will be 32 cells and 64 copies of the gene target distributed amongst them. This implies 
that many copies of those genes are likely to be contained in one or two of those cells. 
Thus we find that it only takes relatively few BFB cycles to generate a cell containing 
multiple copies of a gene. If the gene is an oncogene, this cell then becomes a good 
target for selection and subsequent clonal expansion, producing the types of amplicons 
observed in cancer. 

Selection thus plays a fundamental role in the evolution of these structures and a 
fuller investigation of selection acting across a growing set of cells undergoing a BFB 
process is warranted. 



6 Conclusions 

We have highlighted some of the genomic complexities that arise from the BFB process 
that underlies the copy number profile of many amplicons observed in cancer. Although 
not every copy number profile can arise from a BFB process, the number of different 
BFB evolutions rises spectacularly quickly with the number of BFB cycles. Further- 
more, a single copy number profile may be possible from more than one BFB evolution, 
complicating the inference of the correct evolution. For such degenerate cases, use of 
additional in-silico methods such as [4], or experimental methods such as Fluorescent 
In Situ Hybridisation (FISH) , will be necessary to help identify the actual chromosomal 
structure and underlying process. 

This work provides some understanding to the evolution of amplicons. However, am- 
plicons can arise from other processes such as tandem duplication [7] or double minutes 
|14j . for example, and amplicon evolution in general will be somewhat more compli- 
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cated, possibly involving combinations of these processes, as well as other unexplored 
mechanisms. 

This analysis also assumes that the data arise from a single dominant clone, which is 
not always the case [2J [TU] - All of these other factors will have to be taken into account 
if we are to unravel more general evolutions of amplicons. However, the work presented 
is one step in that direction. 
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A Appendix - Proofs 



Proof. Lemma 2.1 Suppose we have a palindromic region word of the form XYY^ 1 X^ 1 
such that the next BFB fold occurs between X and Y. If we duplicate the left (resp. 
right) side of the fold, we get word XX~ l (resp. XYY~ 1 YY~ 1 X~ 1 ). We can similarly 
have a breakpoint at the symmetrically opposite position between Y -1 and X~ l . Du- 
plication of the left (resp. right) side of the fold then produces words XYY~ 1 YY~ 1 X^ 1 
(resp. XX -1 ). These events cannot be distinguished as they have identical products. □ 



Proof. Theorem 2.1 The final product of a BFB sequence must have a fold word of 
the form XnX -1 and so we can select the middle symbol, which must be unique as 
it is the last fold to form. Undoing this fold gives us the word X to consider. Now 
X is a product of a BFB, but fold n may have truncated some sequence Z. X must 
thus have the form ZYmY~ l for some subwords Y and Z (where —1 power indicates 
symbols in reversed order). Now in any word generated by a BFB process, if we have 
two consecutive occurrences of a symbol mi then there must have been a BFB with fold 
7?i2 that duplicated mi with fold mi between the two copies. Thus fold m<i occurred 
later than rrt\ and we have a word of the form ...m\...rri2...m\... . Note that the leftmost 
position of mi does not change position in the word when fold number m,2 is incorporated. 
The leftmost m-2 is to the right of the leftmost mi and we see that the first occurrences of 
fold symbols reflect their evolutionary order. If there is more than one symbol ni2 we can 
repeat the procedure, forming series mi, m.2, ••• until we find a symbol m n occurring once. 
This must exist because each fold symbol rrii is located further into the word than mj_i, 
and the word is finite in length. There may be more than one symbol occurring once 
(resulting in a word of the form Xm n m n+ i...m n + u for unique symbols m n , m n+u ). 
The rightmost symbol m n + u must then be the latest event that we undo in STEP 2. 
Because the word is reduced in size at each step we either obtain a valid BFB evolution 
or the algorithm fails and the word is not a viable representation of a BFB process. □ 



Proof. Theorem 3.1 We show that s n counts positively labeled segments of the n th 
BFB structure inductively. This is true for n = 1 because we start with s\ = r\ = 1 
segment with label 1. Assume true for n = k so that we have Sk > positive values 
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labeling the segments after the midpoint, and so 2s k labels in total. Then we can select 
label Tk+i for the next BFB with —Sk < r^+i < Sk- This means that if r^+i > we 
duplicate Sk non-positive and r^+\ positive labels, producing Sk + r^+i = Sfc+i > 
new positive labels, as required, if rk+i < then Sk — (— i"fc+i) counts the number of 
(negative) labels that are duplicated, again producing Sk + rk+i = Sfc+i > new positive 
labels. Thus s n always counts the number of positively labeled segments. 

We then find that the structure formed by the n th BFB cycle has 2s n segments, 
s n with positive labels and s n with non-positive labels, separated by the structures 
midpoint, which is the position of the fold formed by the n th BFB cycle. Subsequently, 
s n counts the number of segments we traverse through the structure until we encounter 
this fold, as required. 

Now, the {k — l) th and k th BFB folds occur on the sjuLi and s^ 1 segment. Now 
if r& = Sk — Sfc-i < 0, fold A; — 1 is positioned further into the structure than fold 
k, and so is deleted. Then the (k - l) th fold can be removed and cumulative se- 
quence [..., Sk-2, Sk-i, Sk, ■ ■■] can be replaced with [..., Sk-2, $k, ■■■]■ Taking differences be- 
tween consecutive terms, BFB sequence \...,rk-2, rk-\,Tk, ■ ■■] then becomes [..., rk-2, Sk — 
Sk-2, •••] = [-,rk-2, {sk ~ Sfe-i) + (s fe _i - s fc __ 2 ), •••] = [■■•,rv- 2 ,r fe _i + r k , ...], one term 
has been absorbed and one BFB has been removed. 

We have seen that the k th BFB fold is located on the sfi segment. The first fold of 
this structure points in direction — p, and the direction alternates with segments. We 
thus find that the k th fold points in direction p(—l) Sk , as required. □ 



Proof. Lemma 3.1. The fold word symbols indicate the order of folds occurring in the 



structure. From Theorem 3.1, if r n is the next value in the BFB sequence, the segment 
this fold occurs on is s n from the end. This stretches between values £vy n _i(s„) and 
x w„-i(s n +i)- The value s n counts the segments from the start of the structure, which 



alternate in direction, so by Therorem 3.1, d n indicates which of ^w n _i(s„)! x w„-i(s n +l) 
is larger, giving the inequalities specified. □ 



Proof. Lemma |3.2 We have two cases to consider. For case I, if node n formed from 



a and 6(> a) is plain, then we have word ..ab.. becoming ..ana... When we then connect 
a new node n' we have ..ana., becoming either ..an'a.. or ..ann'na... In either case we 
construct a major edge from n to n', and a minor edge from a to n' . For case II, if 
node n formed from a and b is flipped, then we have word ..ba.. becoming ..bnb... When 
we then connect a new node n' we have ..bnb.. becoming either ..bn'b.. or ..bnn'nb... In 
either case we construct a major edge from n to n', and a minor edge from b to n'. □ 



Proof. Lemma 3.4 Here we let v n ^ m represent the number of reduced BFB sequences 
such that s n = Ylk=l rk = m - Vl > n ^ s ^ nus zer0 a P ar t from the first unit entry. By the def- 
inition in Theorem |3.1| of reduced BFB sequences, each sequence [r\,r2, r n ] can have a 
subsequent positive entry r n+ ie{l, 2, m}. Thus, conversely, a sequence [r\,T2, r n+ i] 



of length n + 1 and total m must contain the sub-sequence \r\,r2, ■■■,r n ] with a total 
ranging from L^^f^J toro-1. This gives us the relation v n+ i, m = Y^u-^m+± i v nk- Sum- 

k— I „ J 
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ming v m , n over second index m then counts the number of representative sequences of 
length n. 

Similarly, we let w n ^ m represent the number of full representative sequences \r\ , r-i , . . ., r T: 
such that s n = Y2k=i r k = m - w l,n is thus zero apart from the first unit entry. By the 
definition in Theorem |3.1| of full BFB sequences, each such sequence can then have a sub- 
sequent entry r n+ \e{ — (m — 1), — (m — 2), ...,0, 1,2, ...,m}. Thus, conversely, a sequence 
r±, r2, r n+ \ of length n+1 and total m must contain the sub-sequence ri, T2, r n with 
a total ranging from L^ 2 ^] to oo. This gives us the relation w n+ \^ m = y^^ m+i . w n ±. 
Summing w n . m over second index m then counts the number of full representative se- 
quences of length n. □ 



Proof. Lemma 3.5 We have three cases to consider. 

For SS move I, suppose we have word ..ba.. (b > a) becoming ..bmb... Then we have 
a minor flipped edge from atom and major flipped edge from b to m. To introduce first 
fold 1 adjacent to m, we must start with a word of the form ..bla... Node b is adjacent 
to fold number 1 and is part of subtree s. We then find that ..61a.. becomes either 



..blmlb.. or ..bmb... From Lemma 3.2 in the former case we find node m has a major 
edge connected to a, that is we have switched the major for minor edge and performed 
an ES operation. In the latter case we get the same result as before and no changes are 
made to major/minor status. Because m is not adjacent to fold number 1, edge bm is 
not part of the subtree. 

For SS move II, suppose we have word ..ab.. (b > a) becoming ..ama... Then we have 
a minor plain edge from atom and major plain edge from b to m. To introduce first 
fold 1 adjacent to m, we must start with a word of the form ..al6... Node b is adjacent 
to fold number 1 and is part of the subtree. We then find that ..alb., becomes either 



..almla.. or ..ama... From Lemma 3.2, in the former case we find node m has a major 
edge connected to 6, that is we have the same major edge. Because fold 1 is adjacent to 
m, edge bm is in the subtree. In the latter case node m has major edge connected to a, 
that is we have switched major for minor edge. Because m is not adjacent to 1, edge bm 
is not in the subtree. That is, when the plain edge is adjacent to the subtree we switch, 
otherwise we leave alone. 

The remaining edges have unmodified evolution and the order tree is unchanges. 
These are all the moves required for SS operations on a subtree. □ 



Proof. Lemma 3.6 We prove the result by induction. 

For the tree with two nodes (n = 1), we have two subtrees, both of which result in 
b s = 1 with an unchanged tree under SS operations, so ]C{ses(T)-b s =r} 0{T S ) = 20(T) 
as required. We now assume the result is true for all such trees with n nodes (or less) 
below the root. Now consider a tree with n + 1 nodes below the root, such as Figure 
□DAI 

First consider the case b s = n + 1. There are two subtrees that leave the tree 
unchanged under SS operations. First is the empty subtree. The second subtree consists 
of the component of the tree composed of plain edges attached to the root. SS operations 
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Figure 11: In Ai we have a rooted tree with two branches. The bold edges indicate 
a subtree. In Bi and Ci we have the two branches as distinct trees along with their 
induced subtrees. In ii we have the corresponding trees after the ER operations. The 
node counts of the node b below each root are indicated. 



then leave the tree fixed and the order is the same. All other subtrees will send at least 
one edge to the root and reduce b s . This gives us Y^{ses(T)-b s =n} = 20(T). 

Now consider the case that b s = r + l<n+l. There can be two types of edges 
descending from the node below the root (node labelled b in Figure 11). We can have a 
flipped edge, such as node c, or a plain edge such as node d. By Lemma 3J2 any children 
of flipped nodes c must have minor edges connected to node b below the root. No minor 
edges can touch the root. Conversely, any children of plain nodes such as d must have 
minor edge attached to the root a, but not node b below. 

Suppose we have i = 1,2,..., I indexing flipped nodes Cj adjacent to b and j = 
1, 2, J indexing plain nodes dj adjacent to b. Any subtree s of T is going to result in a 
modified order tree such as Figure [TT|A-ii following SS operations. We restrict attention 
to subtrees that result in a tree T s such that node b has value b s = r + 1. Prior to SS 
operations flipped node c, has value n«. After SS operations this becomes Hi — r», for 
some value rj, with ri nodes now descending from node b. Prior to SS operations plain 
node dj has value n'-. After SS operations this becomes some value r'- with n'- — r' nodes 

J J J J 

now descending from b. If ^ + r 'j = r node b then has value r + 1. 

Now the subtree s can be split into subtrees Sj and s'- that pass through nodes q 
and dj. We also subdivide tree T according to nodes Cj and dj into Tj and T- as follows. 
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For flipped nodes we take nodes Cj, their descendants, node b, and all minors attached to 
these nodes (such as Figure [TT^3i) . For plain nodes we take nodes dj, their descendants, 
node a, and all minors attached to these nodes (such as Figure [TT]Gi) . The subtrees Si 
and Sj can then induce SS operations on corresponding trees Tj and Tj. 

It is convenient to define ratios R(T S ) = jj^j = p^J^f > w here and m^ s denote 
the node values of node k, as described in Lemma [3 .3[ before and after the SS operations, 
respectively. Note that because the root node value (the total number of nodes in the 
tree) is unaffected by SS operations, m root and m roo t, s cancel and do not contribute to 
the sum R(T S ). 

We then find that, 



E «Pi)= E E - E 

{sES(T):b s =r+l} { ri +...+r R =r} {si6S(T ljSl ):6 a =ni-ri} {sieS(r /iS/ ):6 s =n r -r I } 

e - e n^..,)n^) 1+E -;^ E '" ; 

Ke5(^ s ,):6 s =r' 1 } {s'jeS(T' Ja , ):b s =r'j} i=l 3=1 



where we get a product of terms R(T S ) from each subtree, along with the last term 
corresponding to node b. Because trees Tj and Tj have less than n + 1 nodes we can use 
the inductive hypothesis and this sum becomes, 



3 

{n+...+r H =r} {si6S(Ti):6 s =ni-n} {s J 6Sf(T/):6 s =n/-r I } 

E R( T i,sO- E 



'.7 ' 



l,s 1 J,Sj 

{ri + ...+r fl =r} i=l 3=1 { ri +...+r H =r} 8=1 3=1 



- ; U r — "-T+l 

1 + r 



Then substituting R(T S ) = ( Q^m gives the required result. □ 



Proof. Theorem 4.1 We abuse notation throughout and equate random variables 
with their values. The required result can be demonstrated with induction. The ini- 
tial distribution (n = 1) of L\ is the uniform distribution U([0,2L]), reflecting the 
uniform choice of the first breakpoint in [0, L] prior to duplication, in agreement with 
the formula for P(L\). At each step of the BFB process we pick a breakpoint from 
U([Q, L n -i\) and double the length of the retained piece. We thus have P{L n \L n ^\) = 
, < L n < 2L n -\. Then assuming the form for n — 1 we find that, P(L n ) = 

}L n P(L n \L n —i)P{L n -i)dL n -i. An integration by parts then gives the desired form. 

2 
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Now (L„,|L n _i) ~ f/([0,2L n _i]) gives us the Martingale property that E Ln \ Ln _ 1 (L n ) = 
L n -i, thus the initial value E{L\) = L tells us that the mean length of the BFB segment 
is L. 

The variance J L^P(L n )dL — I? follows from an integration by parts. □ 



Proof. Theorem 4.2[ This can be shown inductively. Initially, if the length of the 
structure produced by the m th BFB cycle is L m , then clearly the fold at the midpoint 
is a distance ^ from either end after the m th event. We then assume that all copies of 
the m th BFB fold are at least a distance from either end of the structure prior to the 
n th BFB (so n > m). One of two things can happen. Either the n th fold is nearer to the 
ends than ^ (so L n < L m ), in which case all the copies of the m th BFBs are deleted, 
or L n > L m and some of them are duplicated, including a BFB nearest to the end, so 
the smallest distance kf- is preserved, as required. □ 



Proof. Theorem 4.3. We first establish the formula for Wk(x, y) by induction. We have 
initial value W\{x,y) = f 2y — = log(2y) — log(x) which matches the result. We assume 
true for m. Now we have W m+1 (x,y) = ^^A dz = gv £™ Q a m i{x) ^SfA dz . 

Integration by parts gives f* v l ^f^± dz = ^ T (log- ?+1 (2 m+1 y) - log J ' +1 (2 m :r)) so that 
W k+l (x,y) = E J m =0 a- +1 (x) J i T (W +1 (2 m + 1 y) -log^ +1 (2 m x)). Thus we find a™ +1 (x) = 
- £7=0 ]h a 7+iW log J+1 (2 m x) and 6%${x) = %af, for j > 1, so that a m+1 = B m+1 a m , 
as required. 

Now if we have length L n _i prior to the n BFB, then assuming the fold oc- 
curs uniformly along the length, the BFB duplication results in P{L n \L n -i) = jjt 



J n — 1 



with < L n < 2L„,_i. The length sequence L n is also Markovian. Thus we can 
write P(L 1 ,L 2 ,...,L n ) = Uk=i P ( L k\ L k-i) = 2 ^L\\ n - 1 L k where L ™ < 2L n-i < ••• < 
2 n ~ l Lx < 2 n L = TL. 

The probability that the k th of iV BFBs have minimum length L k = x is given by 
M kjN (x,L) = Pr(Li > x, ...,L k _i >x,L k = x,L k+1 > x, ...,L N > x) = Pr(Li,L 2 , 
x,L k = x)Pr(L k+ \, ...,Ln > x\L k = x), where we have used the Markovian property of 
the length sequence. 

The first term can be obtained by integrating the above density, 



I r2L i>2Li r2L k _ 2 | 



Pr(Li,L 2 , ...,L k -i > x,L k = x) = / / ... / - — - dL k _i...dLi 

J X J X J X 1 * " • fc 1 

1 W k (x,L) 



2 k L 

The second term we similarly find as, 
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Pr{L k+1 ,L k+2 , L N > x\L k = x 

^ f2x r 2L 



1 



2N-k 
1 

2N-k-l 



2L N _2 



2x r 2L 



x J x 

2x f -2L k f -2L? 



X J x 

2 



2L» 



Lu....Ln- 



-cLLn ...dLk+i 



N-l 



1 



1 



Lu....L 



N-2 



-dLjsr-i...dLk+i 



dL,N-i...dLk+i 



N-k 



1=1 



Putting these two terms together gives the required form. 



Proof. Corollary 4.1v. The required integral can be recursively split as follows: 

"2L, 



1 



1 



2l\ ?2l d _x y 

... I dl d ...dh 

Li JLi n-—id-i 

2Li— i f2l\ f2ld-2 



2 d Li 



i 1 J Li J Li J Li 



2«*-iL i _ 1 ^ 



l\....l d _2 l\....ld-l 
2 1 

-dld-i-.-dh 



dl d _ 1 ...dh 
L 



□ 



h----ld-2 



W d ^(Li,L 



i-lj 



fe=0 



2 fc 
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